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Abstract 

XML,  Web  services,  and  the  Semantic  Web 
have  opened  the  door  for  new  and  exciting  in¬ 
formation  integration  applications.  Informa¬ 
tion  sources  on  the  web  are  controlled  by  dif¬ 
ferent  organizations  or  people,  utilize  differ¬ 
ent  text  formats,  and  have  varying  inconsis¬ 
tencies.  Therefore,  any  system  that  integrates 
information  from  different  data  sources  must 
identify  common  entities  from  these  sources. 
Data  from  many  online  sources  does  not  con¬ 
tain  enough  information  to  accurately  link  the 
records  using  state  of  the  art  record  linkage 
systems.  There  is  an  inherent  need  for  learn¬ 
ing  in  these  systems,  most  of  the  time  re¬ 
quiring  a  user  in  the  loop,  to  accurately  link 
records  across  datasets.  In  this  paper  we  de¬ 
scribe  a  novel  approach  to  exploiting  addi¬ 
tional  data  sources  to  design  an  unsupervised 
record  linkage  method.  Our  evaluation  us¬ 
ing  real  world  data  sets  shows  that  the  per¬ 
formance  of  unsupervised  learning  in  a  record 
linkage  system  is  on  par  with  traditional  su¬ 
pervised  learning  methods. 

1  Introduction 

In  the  recent  past,  researchers  have  developed  vari¬ 
ous  machine  learning  techniques  such  as  SoftMealy  [7] 
and  Stalker  [13]  to  easily  extract  structured  data  from 
various  web  sources.  Using  those  techniques,  users 
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can  build  wrappers  that  allow  them  to  easily  query 
web  sources  much  like  databases.  Web-based  infor¬ 
mation  integration  systems  such  as  Information  Man¬ 
ifold  [15],  InfoMaster  [4],  and  Ariadne  [9]  can  provide 
a  uniform  query  interface  for  users  to  query  informa¬ 
tion  from  various  web  sources  as  well  as  databases. 
While  the  above-mentioned  systems  can  integrate  in¬ 
formation  from  various  data  sources,  none  of  them 
completely  address  the  issues  relating  to  textual  in¬ 
consistencies  across  several  data  sources.  For  example, 
two  restaurant  web  sites  may  refer  to  the  same  address 
using  different  textual  information.  Therefore,  record 
linkage  is  essential  to  accurately  integrate  data  from 
various  data  sources. 

There  has  been  some  work  done  on  linking  records 
from  various  web-sites  using  textual  similarities  and 
transformations  [1,  2,  3, 17].  These  approaches  provide 
better  consolidation  results  when  compared  to  exact 
text  matching  techniques  in  different  application  do¬ 
mains.  However,  all  of  these  systems  rely  on  a  fair 
amount  of  user  interaction,  whether  it  be  in  labeling 
match  pairs  [1,  17],  creating  reference  and  input  ta¬ 
bles  [2],  or  designing  domain  specific  profilers  using  a 
domain  expert’s  knowledge  [3].  By  incorporating  ad¬ 
ditional  data  sources  into  the  loop,  we  can  use  them  to 
label  record  pairs.  There  are  many  application  areas 
where  information  from  additional  data  sources  can 
provide  important  domain  knowledge.  Examples  in¬ 
clude  utilizing  a  geocoder  to  determine  if  two  addresses 
are  the  same,  utilizing  historical  area  code  changes  to 
determine  if  two  phone  numbers  are  the  same,  and  uti¬ 
lizing  the  location  and  officers  information  for  different 
companies  to  determine  if  two  companies  are  the  same. 

The  goal  of  our  research  is  to  provide  a  framework 
for  accurately  linking  records  across  data  sources  in  an 
unsupervised  manner.  In  our  previous  work  [11,  10], 
we  showed  how  primary  sources  can  be  augmented 
with  data  obtained  from  secondary  sources  to  improve 
the  record  linkage  process.  In  this  paper,  we  present 
an  extension  to  Apollo’s  active  learning  component  to 
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Figure  1:  Textual  Inconsistencies  Across  Data  Sources 


address  the  issue  of  user  involvement.  Using  secondary 
sources,  a  system  can  autonomously  answer  questions 
posed  by  its  active  learning  component.  By  diverting 
questions  to  the  system  itself,  the  entire  record  linkage 
process  minimizes  the  involvement  of  a  user.  In  the  ex¬ 
tended  Apollo  system,  user  involvement  is  limited  to  a 
preprocessing  step  used  to  evaluate  secondary  sources. 

In  presenting  our  approach,  we  will  provide  a  moti¬ 
vating  example,  followed  by  some  background  work  on 
record  linkage.  Subsequently,  we  present  an  analysis  of 
our  methodology  for  unsupervised  learning  using  sec¬ 
ondary  sources.  We  will  then  present  the  evaluation 
of  our  approach  using  real  world  restaurant  data  sets 
with  both  supervised  and  unsupervised  learning  tech¬ 
niques.  Finally,  we  will  discuss  related  work  and  put 
forward  our  conclusions  and  planned  future  work. 

2  Motivating  Example 

To  clarify  the  concepts  presented  in  this  article,  we 
will  define  the  following  terms:  (1)  Record  Linkage,  (2) 
Primary  data  sources,  and  (3)  Secondary  data  sources. 
Record  linkage  is  the  process  of  determining  if  two 
records  refer  to  the  same  entity.  A  primary  data  source 
is  one  of  the  two  initial  data  sources  used  for  record 
linkage.  A  secondary  data  source  is  any  source,  other 
then  a  primary  data  source  that  can  provide  additional 
information  about  entities  in  the  primary  data  sources. 
Consider  the  following  primary  data  sources: 

•  Zagat  and  Dinesite  data  sources  that  provide  in¬ 
formation  about  various  restaurants. 

•  Travelocity  and  Orbitz  data  sources  that  provide 
information  about  various  hotels. 

•  Yahoo  and  Movietone  data  source  that  provides 
information  about  various  theaters. 

When  the  user  sends  a  request  to  obtain  informa¬ 
tion  pertaining  to  restaurants  within  a  given  city,  the 
record  linkage  system  needs  to  link  records  that  re¬ 
fer  to  the  same  restaurant  from  the  Zagat  and  Dine¬ 


site  data  sources.  However,  due  to  the  textual  incon¬ 
sistencies  present  in  both  data  sources,  determining 
which  records  refer  to  a  common  entity  is  a  non-trivial 
task.  Figure  1  shows  the  varying  textual  inconsisten¬ 
cies  found  in  the  restaurant  data  sources.  A  similar  sit¬ 
uation  arises  when  attempting  to  combine  information 
about  hotels  from  Travelocity  and  Orbitz,  or  about 
movies  from  Yahoo  and  Moviefone. 

Furthermore,  when  linking  records  across  data 
sources,  it  is  desired  that  a  system  be  able  to  per¬ 
form  this  task  as  autonomously  as  possible.  It  is  te¬ 
dious  for  a  user  to  label  a  large  number  of  record  pairs 
as  matches  or  non-matches  and  desirable  for  a  sys¬ 
tem  to  accomplish  this  task  with  minimal  involvement 
from  the  user.  However,  for  such  a  system  to  be  suc¬ 
cessful,  it  must  be  as  good  as  if  a  user  was  in  the 
loop.  Therefore,  it  is  important  to  maintain  accuracy 
while  introducing  autonomy.  This  accuracy  is  main¬ 
tained  by  using  specialized  matching  techniques  found 
in  secondary  sources  rather  then  encoding  all  possible 
matching  techniques  in  the  system  itself. 

3  Previous  Work 

In  this  section,  we  present  the  Active  Atlas  [16,  17] 
and  Apollo  [11,  10]  record  linkage  systems.  Active 
Atlas  is  used  as  a  foundation  upon  which  the  Apollo 
system  is  built.  Apollo  automatically  augments  pri¬ 
mary  sources  with  secondary  source  information  to 
improve  the  record  linkage  process.  Its  robust  extend¬ 
able  framework  make  it  an  ideal  candidate  for  a  base 
system  upon  which  to  build  an  unsupervised  record 
linkage  system. 

3.1  Active  Atlas  Overiew 

Active  Atlas’  architecture  consists  of  two  separate 
components:  a  candidate  generator  and  a  mapping 
learner.  Its  goal  is  to  find  common  entities  amongst 
two  record  sets  from  the  same  domain.  The  candi¬ 
date  generator  proposes  a  set  of  potential  matches 


based  on  the  transformations  available  to  the  sys¬ 
tem.  The  transformation  may  be  one  of  a  number 
of  string  comparison  types  such  as  equality,  substring, 
prefix,  suffix,  stemming,  or  others  and  are  weighted 
equally  when  computing  similarity  scores  for  potential 
matches.  Once  the  candidate  generator  has  finished 
proposing  potential  matches,  Active  Atlas  moves  on 
to  the  second  stage  and  uses  the  potential  matches  as 
the  basis  for  learning  mapping  rules  and  transforma¬ 
tion  weights. 

The  mapping  learner  establishes  which  of  the  po¬ 
tential  matches  are  correct  by  adapting  the  mapping 
rules  and  transformations  weights  to  the  specific  do¬ 
main.  Due  to  the  fact  that  the  initial  similarity  scores 
are  very  inaccurate,  the  system  uses  an  active  learn¬ 
ing  approach  to  refine  and  improve  the  transformation 
weights  and  mapping  rules.  This  approach  uses  a  de¬ 
cision  tree  committee  model  for  learning  with  three 
members  in  the  committee.  The  mapping  learner  se¬ 
lects  the  most  informative  potential  match  and  asks 
the  user  to  label  this  example  as  either  a  match  or 
non-match.  The  users  response  is  used  to  refine  and  re¬ 
calculate  the  transformation  weights,  learn  new  map¬ 
ping  rules,  and  reclassify  record  pairs.  This  process 
continues  until:  (1)  the  committee  learners  converge 
and  agree  on  one  decision  tree,  or  (2)  the  user  has 
been  asked  a  pre-defined  number  of  questions.  Once 
the  mapping  rules  and  transformation  weights  have 
been  learned,  Active  Atlas  uses  them  to  classify  all 
the  potential  matches  in  the  system  as  matched  or  not 
matched.  The  results  are  then  made  available  to  the 
user. 


Algorithm  :  ApolloLearner(LE,  SS,  N,  AllCM) 


procedure  EvaluateSS(A£;,  SS) 

GoodS S  <-  0 
for  each  src  £  SS 
f  #true  <—  0 
# false  <—  0 

for  each  example  £  LE 

{label  <—  Label(s?’c, example ) 
if  label 

then  #true  <—  #frue  +  1 
else  # false  <—  # false  +  1 
if  (#frwe  -7-  (#true  +  # false))  >  0.75 
then  GoodS S  <—  GoodS S[J  src 
return  ( GoodSS ) 


do  < 


do 


procedure  Learn (GoodS1 S',  LE,  N,  AllCM) 
nLEsets  <—  DivideNSets(TE,  N) 
for  each  set  £  nLEsets 
,  ( dt  <—  LEARNDT(sef) 

°  \labels[set ]  <—  Classify  (dt,  AllCM) 
nextExp  <—  GETlNFORMATiVEExp(la6eZs) 
if  nextExp  not  </> 

{for  each  exp  £  nextExp 

do  LE  <—  LE  (J  Label(Gooc?SS,  exp) 
Learn  (GoodS S,  LE,  N,  AllCM) 
return  ( labels ) 


main 

GoodSS  <-  EvaluateSS(LE,  SS) 
Learn  (GoodSS,  LE,  N,  AllCM) 


3.2  Automatically  Augumenting  Primary 
Data  Sources 

Current  record  linkage  systems  [1,  3,  6,  8,  12,  15,  17] 
excel  at  learning  how  to  weigh  attributes  and  link 
records  across  data  sources.  Using  machine  learning 
techniques  such  as  decision  trees  [14]  or  Baysian  Net¬ 
works  [5],  they  are  able  to  determine  which  attributes 
are  most  relevant  to  consider  when  trying  to  match 
records  across  different  data  sources. 

The  Apollo  system  [11,  10]  incorporates  secondary 
sources  into  the  record  linkage  process  to  improve  its 
performance.  It  leverages  decision  tree  technology  to 
improve  the  accuracy  of  the  record  linkage  process 
by  augmenting  primary  sources  with  information  ob¬ 
tained  from  secondary  sources.  It  utilizes  an  informa¬ 
tion  mediator  to  determine  if  there  are  any  available 
secondary  data  sources  for  the  given  primary  sources. 
If  there  are  one  or  more  available  secondary  sources,  it 
augments  the  primary  data  sources  with  additional  at¬ 
tribute^).  Labeled  examples  provided  by  a  user  allow 
the  system  to  determine  if  the  newly  added  attributes 
are  informative  enough  to  incorporate  into  the  map¬ 
ping  rules.  With  the  flexible  nature  of  the  record  link¬ 
age  framework  used  in  Apollo,  incorporating  this  ad- 


Figure  2:  Apollo’s  Unsupervised  Learning  Algorithm 

ditional  information  from  secondary  sources  is  an  easy 
and  efficient  process.  Also,  as  shown  in  [10],  this  ap¬ 
proach  leads  to  an  improvement  in  both  precision  and 
recall. 


4  Utilizing  Secondary  Sources  For  Un¬ 
supervised  Learning 

In  this  section  we  describe  Apollo’s  approach  to  identi¬ 
fying  secondary  sources  that  can  be  used  to  accurately 
classify  record  pairs  as  matches  or  non-matches.  More¬ 
over,  we  present  how  Apollo  utilizes  the  identified  sec¬ 
ondary  sources  in  an  unsupervised  active  learning  pro¬ 
cess.  Apollo’s  learning  algorithm  for  using  secondary 
data  sources  is  presented  in  Figure  2.  There  are  two 
main  procedures  in  the  algorithm:  (1)  Evaluating  sec¬ 
ondary  sources,  and  (2)  Automatically  labeling  candi¬ 
date  record  pairs. 


4.1  Evaluating  Secondary  Sources 

There  exist  a  wide  variety  of  potentially  useful  sec¬ 
ondary  sources.  While  most  secondary  sources  provide 
pertinent  information,  not  all  of  them  can  be  used  to 
identify  matched  records.  For  example,  a  company 
may  have  multiple  locations,  therefore  if  two  company 
records  have  different  location  attribute  values,  this 
does  not  imply  that  the  records  do  not  refer  to  the 
same  company. 

In  Apollo,  we  provide  a  simple  mechanism  to  eval¬ 
uate  the  capabilities  of  various  secondary  sources  with 
respect  to  labeling  matched  records.  As  shown  in  the 
EvaluateSS  procedure  in  Figure  2,  we  begin  with  a 
very  small  set  of  training  data  (e.g.  25  record  pairs) 
and  a  set  of  available  secondary  sources.  Next,  we  use 
each  secondary  source  to  label  all  the  record  pairs  in 
the  training  set.  Based  on  the  user  provided  labels 
and  the  labels  given  by  the  secondary  sources,  we  cal¬ 
culate  the  percentage  of  record  pairs  that  are  labeled 
correctly  by  the  secondary  source.  If  the  secondary 
source  can  classify  more  then  75% 1  of  the  record  pairs 
correctly,  we  classify  it  as  being  useful  for  labeling. 

4.2  Unsupervised  Active  Learning 

Once  Apollo  has  identified  which  secondary  source  can 
be  used  to  automatically  label  record  pairs,  it  can  use 
this  secondary  source  to  generate  a  set  of  labeled  train¬ 
ing  examples  used  by  the  record  linkage  process. 

Apollo  utilizes  an  active  learning  approach  de¬ 
scribed  by  Tejada  et.  al  [17]  to  reduce  the  number  of 
examples  labeled  by  the  secondary  source.  As  shown 
in  the  Learn  procedure  in  Figure  2,  it  begins  unsuper¬ 
vised  active  learning  by  composing  a  set  of  labeled 
examples  ( LE )  using  the  selected  secondary  source 
(source  in  GoodSS)  from  the  matches  generated  by 
the  candidate  generator.  It  uses  a  committee  of  N  de¬ 
cision  tree  learners  to  identify  the  most  informative 
examples  that  should  be  labeled  next.  This  is  done  by 
dividing  the  labeled  example  set  into  N  unique  sets, 
and  learning  a  decision  tree  based  on  each  set.  All  the 
candidate  matches  ( AUCM )  are  then  classified  using 
the  N  learned  decision  trees.  The  most  informative  ex¬ 
amples  are  determined  by  choosing  examples  from  the 
candidate  set  with  the  highest  level  of  disagreement 
between  the  N  decision  trees. 

The  selected  examples  are  then  labeled  using  infor¬ 
mation  retrieved  from  the  chosen  secondary  source.  In 
a  traditional  active  learning  process,  this  labeling  re¬ 
quires  a  user  or  a  domain  expert.  This  requirement 
is  negated  in  Apollo  by  utilizing  secondary  sources  in 
place  of  a  user.  Once  all  the  selected  examples  are  la¬ 
beled,  Apollo  re-learns  N  decision  trees  by  adding  the 
newly  labeled  examples  to  the  set  of  labeled  examples. 
This  is  done  to  obtain  the  next  set  of  the  most  in- 

1This  is  a  manually  selected  value  and  we  are  working  on  a 
method  to  learn  this  value  automatically. 


formative  examples.  The  process  is  repeated  until  all 
decision  trees  converge. 

It  should  be  noted  that  Apollo  is  not  limited  to 
using  one  secondary  source  for  labeling  examples.  If 
there  exist  multiple  useful  secondary  sources,  Apollo 
can  utilize  all  these  sources  to  label  the  informative 
examples.  This  is  further  explored  in  section  7. 


5  Experimental  Evaluation 

We  evaluated  the  idea  of  utilizing  secondary  sources 
to  label  record  pairs  as  matches  or  non-matches  by 
performing  four  sets  of  experiments.  The  goal  of  the 
experiments  was  to  show  that  we  can  achieve  almost 
optimal  performance  from  Apollo  when  performing  un¬ 
supervised  learning  using  secondary  sources  for  label¬ 
ing.  In  all  sets  of  experiments  we  performed  linkage 
across  datasets  in  the  real  world  restaurant  domain. 

We  used  wrapper  technology  discussed  in  [13]  to  ex¬ 
tract  restaurant  records  from  the  Zagat’s  and  Dinesite 
web  sources.  Each  web  source  provided  a  restaurant’s 
name,  address,  city,  state,  phone  number,  and  cuisine 
type.  The  Zagat  data  source  contained  897  records, 
while  the  Dinesite  data  source  contained  1257  records. 
There  were  136  matching  records  in  the  two  datasets. 
Due  to  the  inconsistencies  between  the  two  sources, 
a  record  linkage  system  was  required  to  find  com¬ 
mon  restaurants.  Available  secondary  data  sources 
included:  a  geocoder  that  provided  geographic  coordi¬ 
nates  for  a  given  address,  and  a  postal  web  site  which 
provided  9-digit  zipcode  for  a  given  address.  Each  set 
of  experiments  contained  10  runs  and  the  results  shown 
are  the  average  values  for  all  runs. 

In  the  first  set  of  experiments  we  utilized  the 
geocoder  to  generate  training  examples.  If  the 
geocoder  returned  the  same  geographic  coordinates  for 
the  address  of  the  two  records  in  a  record  pair,  the 
record  pair  was  labeled  as  a  match.  In  the  subsequent 
sets,  we  performed  the  experiments  using  the  postal 
service  web  site,  a  random  method,  and  user  involve¬ 
ment  to  label  record  pairs.  In  the  random  method, 
we  randomly  generated  labels  for  the  training  exam¬ 
ples.  The  performance  of  the  Apollo  system  with  each 
secondary  source  was  compared  to  the  performance  of 
the  Apollo  system  requiring  a  user  to  label  record  pairs 
and  to  random  labeling.  The  results  were  measured  us¬ 
ing  the  F-measure  formula,  which  combines  precision 
and  recall  measures  as  shown  below. 


F  —  measure  = 


2  X  recalls  precision 
recall-\-precision 


Figure  3  shows  that  the  Apollo  system  with  user  la¬ 
beling  performs  just  a  little  better  when  compared  to 
the  Apollo  system  with  secondary  source  labeling.  In 
the  case  of  both  secondary  sources,  Apollo  performs 
better  then  the  “strawman”  approach  of  returning  a 
random  label  when  the  system  asks  the  user  to  label 
a  match. 


Figure  3:  Unsupervised  Learning  Using  Secondary 
Sources 

In  particular,  the  Apollo  system  with  the  geocoder 
secondary  source  reaches  an  optimal  F-measure  of 
93.88  with  150  labeled  examples.  The  Apollo  system 
with  user  labeling  reaches  an  optimal  F-measure  of 
97.22  with  100  labeled  examples.  However,  in  the  case 
of  the  Apollo  system  with  the  user  labeling,  the  user 
has  the  task  of  labeling  100  record  pairs,  while  in  the 
Apollo  system  with  automatic  unsupervised  labeling, 
the  user  does  not  need  to  label  any  of  these  pairs. 

The  Apollo  system  with  the  postal  service  web  site 
also  gets  to  a  F-measure  value  of  85.71  with  140  la¬ 
beled  examples.  Due  to  the  fact  that  there  exist  dif¬ 
ferent  restaurants  with  the  same  9-digit  zipcode,  the 
system  alternates  between  learning  strict  and  looser 
mapping  rules.  Due  to  inaccurate  labeling,  an  oscilla¬ 
tion  between  100%  precision  and  recall  occurs.  This 
oscillating  effect  can  be  seen  in  the  zipcode  results  in 
Figure  3. 

While  both  secondary  sources  perform  well,  they 
perform  worse  than  the  Apollo  system  with  user  la¬ 
beled  examples.  The  key  reason  behind  this  is  due 
to  the  inaccurate  labeling  mentioned  earlier.  In  gen¬ 
eral,  it  would  be  difficult  to  find  one  “golden”  sec¬ 
ondary  source  that  can  accurately  identify  records  as 
matches  or  non-matches.  We  can  improve  the  accu¬ 
racy  of  the  labeling  process  by  combining  information 
from  multiple  secondary  sources,  or  by  combining  sec¬ 
ondary  source  information  with  attributes  of  the  pri¬ 
mary  sources  to  label  record  pairs. 

6  Related  Work 

There  has  been  significant  work  done  on  solving  the 
record  linkage  problem  [1,  2,  3,  6,  8,  12,  15,  17].  This 
work  includes  research  on  entity  matching  [3,  8],  object 
consolidation  [8,  17],  and  de-duplication  [1,  6,  12,  15]. 
All  these  systems  utilize  some  form  of  textual  simi¬ 
larity  measures  to  determine  if  two  records  should  be 
linked.  However,  none  of  the  systems  incorporate  the 
idea  of  utilizing  secondary  sources  to  obtain  relevant 
information  and  use  this  information  to  improve  the 
record  linkage  process. 


Doan  et.  al.  [3]  describe  a  profiler-based  approach 
to  improving  entity  matching.  The  key  idea  in  the  pa¬ 
per  is  to  design  profilers  by  mining  large  amounts  of 
data  from  different  web  sources,  obtaining  input  from 
domain  experts,  or  by  examining  previously  matched 
entities.  The  profilers  generate  rules  that  determine  re¬ 
lationships  between  various  attributes  of  entities,  for 
example  someone  with  age  9  is  not  likely  to  have  a 
salary  of  $200,000.  This  idea  is  complementary  to  our 
approach  of  utilizing  secondary  sources  to  provide  ad¬ 
ditional  attributes. 

To  the  best  of  our  knowledge,  our  approach  is  the 
first  to  perform  unsupervised  active  learning  for  record 
linkage.  Most  approaches  that  use  learning  in  the 
record  linkage  process  require  a  human  in  the  loop. 

7  Discussion 

In  this  article,  we  presented  our  approach  to  utiliz¬ 
ing  secondary  sources  for  unsupervised  record  linkage. 
We  showed  that  a  secondary  data  source  can  be  used 
to  provide  training  data  automatically  and  free  the 
user  from  the  burden  of  labeling  data.  Needless  to  say 
that  our  approach  is  only  applicable  in  scenarios  where 
there  exists  viable  secondary  data  sources.  However, 
as  we  pointed  out  earilier,  for  most  data  sources  on  the 
Internet,  there  exist  some  secondary  sources  with  rel¬ 
evant  information.  In  some  enterprise  settings,  where 
one  often  needs  to  work  with  obscure  numbers,  it  may 
be  harder  to  find  secondary  data  sources. 

Our  experimental  evaluation  shows  that  by  utiliz¬ 
ing  secondary  sources  to  automatically  provide  train¬ 
ing  data  to  record  linkage  process,  Apollo  can  dramati¬ 
cally  reduce  a  user’s  involvement  while  maintaining  the 
accuracy  of  the  record  linkage  process.  In  the  future, 
we  plan  to  investigate  how  Apollo  can  improve  the  us¬ 
age  of  secondary  sources  by  utilizing  combined  weights 
of  different  secondary  data  sources  and  various  fields. 
For  example,  we  can  get  better  labels  if  we  utilize  a 
geocoder  in  conjunction  with  cuisine  type  or  restau¬ 
rant  name.  Finally,  even  though  the  transformations 
used  in  Apollo  are  quite  comprehensive,  they  do  not 
cover  all  possible  sets  of  transformations.  To  address 
this  problem,  we  are  working  on  improving  the  field 
(attribute)  level  matching  process.  This  work  applies 
specific  sets  of  transformations  depending  on  the  se¬ 
mantic  types  of  different  attributes  and  leads  to  more 
accurate  confidence  measures  for  the  given  attributes. 
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